5 - PCA, Harmony Integration, and kNN Graphs

Author

CDN team

Published

December 9, 2024

Introduction

This vignette is a hands-on guide to understanding and applying principal component analysis (PCA), k-nearest neighbors (kNN) graph creation, and harmony integration in the context of scRNAseq data analysis. PCA and Harmony are two common methods for creating a dimensionality-reduced “latent space” from scRNAseq cell x gene matrices. “Latent space” is the term used to describe any matrix that can be calculated with a reduced size while still preserving the relationships between samples of the original matrix. PCA creates the latent space by deriving the combinations of genes that describe the greatest variance across samples in the dataset. Harmony creates a latent space that reduces a batch effect by reducing differences between batches within cluster or latent space variables separately. We calculate the final kNN graph of relationships between cells based on a latent space, because the process of creating the graph scales multiplicatively with number of input features.

Vocabulary

Latent Space

Any matrix that can be calculated from a data matrix with a reduced size while still preserving the relationships between samples of the original matrix. Latent means “existing as potential”, as in “latent abilities” - so a latent space is a hidden set of dimensions in the data that could be useful in ways the raw data is not.

Feature

A variable in a sample x variable matrix. In a cell x gene matrix, the gene is the feature. In PCA, the principal components are the features.

Dimensionality Reduction

The computational process of reducing a large data matrix to fewer features. For example, in scRNAseq, PCA is the most popular dimensionality reduction technique. With it we can reduce our features from ~30,000 genes to ~30 principal components, reducing the computational workload of complex analysis tasks, like creating the nearest neighbors graph.

Embeddings

This is another term for a latent space, but used to refer to what we have done to the cells. The two terms are most often interchangeable. This term implies the “embedding” or placement of the cells in a different feature space.

Loadings

This is the term used to describe the feature translation matrix. Which genes are in which PC? The Loadings matrix shows the weights of each gene in each PC. A negative value in this matrix means expression of the gene pushes cells in the negative direction of the PC, a positive value pushes the cell in the positive direction.

Additional Reading

Principal Component Analysis (PCA)

  • A short synopsis of PCA by Josh Starmer, aka StatQuest - This video uses scRNAseq data to explain some ideas behind PCA
  • Understanding PCA by Victor Powell - a visually interactive explanation of the algorithm behind PCA.
  • Jolliffe, I.T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202. read here

Harmony Integration

k-Nearest Neighbors (kNN) Graphs

  • A nice introduction to kNN graphs can be found at kNN Visualization. We recommend this site in general for learning about algorithms in the context of single-cell RNA sequencing!

Key Takeaways

  • PCA and other latent-space algorithms are essential in the scRNAseq workflow for dimensionality reduction. PCA is especially useful because it helps to visualize and interpret major trends and variability in the data. PCA output:
    • a Cell x PC matrix; a new set of variables that combine the most covarying sets of genes in the dataset, for example M phase vs. G1 phase, Immune vs. Epithelial, Sick vs. Healthy.
    • a PC x Gene matrix; the loadings or weights of each gene’s importance in each PC
    • a Variance Explained list; the standard deviation in the dataset captured by each PC
  • kNN graphs define neighborhoods of cells based on similarity. This is the ultimate goal of scRNAseq data processing, because it allows us to identify clusters of cells. kNN output:
    • a Neighborhood x Cell matrix; A matrix showing which cells are in which neighborhoods
  • Harmony integration minimizes batch effects by regularizing batches in any specified latent space. In benchmarking studies, it is often highlighted as the best integration method, balancing the removal of technical noise with the maintenance of known biological signals, like cell cycle. Like PCA, Harmony creates a latent space that can be visualized. Harmony output:
    • a matrix of harmony integrated PCs per cell
  • These methods are not only relevant in single-cell analysis but also applicable across various fields of data science and can be useful for analysis of any complex dataset.

Libraries

### Make sure all the packages are installed
if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("devtools", quietly = TRUE))
    install.packages("devtools")

if (!requireNamespace("harmony", quietly = TRUE))
    devtools::install_github("immunogenomics/harmony")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("DT", quietly = TRUE))
    install.packages("DT")    

if (!requireNamespace("scales", quietly = TRUE))
    install.packages("scales") 

if (!requireNamespace("ggraph", quietly = TRUE))
    install.packages("ggraph") 

if (!requireNamespace("tidygraph", quietly = TRUE))
    install.packages("tidygraph") 

if (!requireNamespace("ggforce", quietly = TRUE))
    install.packages("ggforce") 

if (!requireNamespace("ggalluvial", quietly = TRUE))
    install.packages("ggalluvial") 

if (!requireNamespace("corrplot", quietly = TRUE))
    install.packages("corrplot") 

if (!requireNamespace("stats", quietly = TRUE))
    install.packages("stats") 

if (!requireNamespace("fastDummies", quietly = TRUE))
    install.packages("fastDummies") 


### Load all the necessary libraries
library(Seurat)
library(tidyverse)
library(devtools)
library(harmony)
library(colorBlindness)
library(DT)
library(scales)
library(ggraph)
library(tidygraph)
library(RColorBrewer)
library(scales)
library(colorRamps)
library(ggforce)
library(ggalluvial)
library(corrplot)
library(stats)
library(fastDummies)

set.seed(687)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from the cellxgene portal.

se <- readRDS("../data/Covid_Flu_Seurat_Object.rds")

Color palette

donor_pal <- c(
    "#66C2A4", "#41AE76", "#238B45", "#006D2C",
    "#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
    "#FFEDA0", "#FED976", "#FEB24C", "#FD8D3C",
    "#FC4E2A", "#E31A1C", "#BD0026", "#800026")

names(donor_pal) <- c(
    "Normal 1", "Normal 2", "Normal 3", "Normal 4",
    "Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
    "nCoV 1", "nCoV 2", "nCoV 3_4", "nCoV 5",
    "nCoV 6", "nCoV 7_8", "nCoV 9_10", "nCoV 11"  
)

To get up to speed with the previous worksheets, process the data in the same way.

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(
        method = "vst",
        nfeatures = 3000,
        verbose = FALSE) %>%
    ScaleData(verbose = FALSE, features = VariableFeatures(.))

Analysis

PCA

When we run PCA on our Seurat object using the Seurat package’s implementation, PCA is calculated on the scaled, highly variable genes. The resulting PCA latent space is placed in the reductions slot.

se <- se %>% 
    RunPCA(
        npcs = 30,
        ndims.print = 1:5,
        nfeatures.print = 10
    )

# This is where latent spaces, aka dimensional reductions are stored. If we call it directly, we can see some information about the PCA.
se@reductions$pca
A dimensional reduction object with key PC_ 
 Number of dimensions: 30 
 Number of cells: 59572 
 Projected dimensional reduction calculated:  FALSE 
 Jackstraw run: FALSE 
 Computed using assay: RNA 

Accessing PCA results

The PC x Cell latent space is stored in the cell.embeddings slot. The Gene x PC matrix of gene weights per PC is stored in the feature.loadings slot. The variance explained per PC is placed in the stdev slot

DT::datatable(se@reductions$pca@cell.embeddings)
DT::datatable(se@reductions$pca@feature.loadings)
(se@reductions$pca@stdev) %>% head
[1] 12.976147  8.673263  6.949142  5.879680  4.845745  4.341204

Using PCA to understand variance in your data

By visualizing the top 2 principal components, we can start to understand the major axes of variance of our data. Seurat has built-in functions to graph the PCs and loadings

se@meta.data$nCount_RNA <- as.numeric(se@meta.data$nCount_RNA)
    
# FeaturePlot allows us to plot any numerical value on any latent space representation of the data. Here we can use it to color PCA by number of RNA counts per cell
nCount_RNA_pca <- FeaturePlot(
        se,
        reduction = "pca",
        features = 'nCount_RNA'
        )

nCount_RNA_pca

# DimPlot allows us to color the latent space based on categorical values per cell, like cell type or which sample the cell is from
celltype_pca <- DimPlot(
        se,
        reduction = "pca",
        group.by = 'Celltype'
        ) 

celltype_pca

# We can also plot further PCs by specifying the dimensions to plot
DimPlot(
    se,
    reduction = 'pca',
    group.by = 'Celltype',
    dims = c(3,4)
    )

donor_pca <- DimPlot(
        se,
        reduction = "pca",
        group.by = 'donor_id'
        ) + scale_color_manual(values = donor_pal)

donor_pca

Based on our DimPlot of the first 2 PCs, we can see the major axes of variation in the data are

  • PC1: myeloid (negative) vs. lymphoid (positive)

  • PC2: erythroid (positive) vs. immune

By visualizing loadings of these PCs, we can see which genes are driving the differences between these groups of cells

VizDimLoadings(
    se, 
    dims = 1:2, 
    reduction = "pca",
    # By setting balanced = TRUE, we see top loadings in each direction of each PC
    balanced = TRUE)

And indeed, we can see that known myeloid genes S100A8, CST, LYZ are in the negative direction of PC_1, and erythroid markers like PF4 are in positive direction of PC_2

What if I used 10, 30, 50, or even 100 PCs?

So far we haven’t even looked beyond PC’s 1 and 2. The PCA algorithm decides on the first PCs first, then moves down to the next. If we recalculate PCA, asking for 10 PCs, we will see that PC_1 and PC_2 are the same, except sometimes when the directions are flipped. The directions of PCs are randomly chosen at the beginning of the algorithm.

If you asked for 100 PCs, you would see the same again for PCs 1 and 2.

The maximum number of PCs is the total number of samples -1. so in this case, it would be nCells - 1. As we increase the number of PCs, we increase the total variance explained by the PCA in the output matrices.

se <- se %>% 
    RunPCA(
        npcs = 50,
        verbose = FALSE
    )

VizDimLoadings(
    se, 
    dims = c(1:2, 20:21, 40:41),
    reduction = "pca",
    # By setting balanced = TRUE, we see top loadings in each direction of each PC
    balanced = TRUE)

But what if we think we are missing out on some of the data by using too few PCs? Seurat has two methods for deciding the number of PCs.

Deciding the number of Principal Components

The most popular method, the Elbow plot method, is the qualitative choice of nPCs based on deminishing returns on percent variance explained by increasing nPCs.

By default, the ElbowPlot shows the standard deviation of each PC. The standard deviation is the square-root of the variance. The total variance in the dataset is stored in se@reductions$pca@misc$total.variance - to obtain the percent variance explained by each PC, we can take the square of the stdev and divide it by the total variance

ElbowPlot(se, ndims = 50)

varExplained <- data.frame(
    varExplained = se@reductions$pca@stdev^(2) / se@reductions$pca@misc$total.variance,
    PC = seq(1:length(se@reductions$pca@stdev))
    )

ggplot(varExplained, 
       aes(x = PC, 
           y = varExplained)) + 
    geom_point() +
    theme_classic() +
    ylab('percent variance explained per PC')

For the purposes of the elbow plot, we can see that the stdev is roughly equivalent to the variance explained per PC.

The Elbow plot is the simplest way to view deminishing returns on variance explained per PC.

Then we simply cut the nPCs off at that point to prevent our PCA from being overly complex, or overfitting the data. One way we can quickly show what noise looks like in our scRNA data is by injecting noise into our dataset.

In standard data science workflows, PCs are included until some threshold of variance explained is reached, often 85%. But if we calculate the total variance explained by each PC up to the cutoff we chose by ElbowPlot, we can see it’s only 20%. This is because scRNAseq data are highly noisy with a great deal of unique variance per cell. We would have to include 100s of PCs to reach 85% variance explained.

And, keep in mind this total variance is calculated based only on the highly variable genes we chose.

varExplained <- data.frame(
    varExplained = se@reductions$pca@stdev^(2) / se@reductions$pca@misc$total.variance,
    PC = seq(1:length(se@reductions$pca@stdev)),
    data = 'subsampled real data')

varExplained$cumulativeSum <- cumsum(varExplained$varExplained)

ggplot(varExplained, 
       aes(x = PC, y = cumulativeSum, color = data)) +
    geom_point() +
    scale_y_continuous(limits = c(0,1)) +
    theme_classic() +
    ylab('cumulative percent variance explained per PC') +
    ggtitle('How much of the total variance in the dataset\ndid we explain with the PCs chosen by our ElbowPlot()?')

Practicing PCA on a subset of the data

For those curious about just how many PC’s we would need to reach that threshold of 85% variance explained, we need to subset our dataset to make visualization and comparisons reasonable.

# select a random sample of 1000 cells
seSub <- subset(se, downsample = 1000) %>%
        NormalizeData() %>%
        FindVariableFeatures() %>%
        ScaleData() %>%
        RunPCA(npcs = 999)

varExplainedsub <- data.frame(
    varExplained = seSub@reductions$pca@stdev^(2) / seSub@reductions$pca@misc$total.variance,
    PC = seq(1:length(seSub@reductions$pca@stdev)),
    data = 'subsampled real data')

varExplainedsub$cumulativeSum = cumsum(varExplainedsub$varExplained)

ggplot(varExplainedsub, 
       aes(x = PC, y = cumulativeSum, color = data)) +
    geom_point() +
    theme_classic() +
    geom_hline(yintercept = 0.85, color = 'red') +
    ylab('cumulative percent variance explained per PC') +
    ggtitle('If we really wanted to include 85% of the variance in our data\n how many PCs would we need?',subtitle = 'in a subset of 1000 cells')

kNN + sNN graphs

Because these data objects can be large, the nearest neighbors graphs are normally stored in the “graphs” slot of the seurat object in a highly compact form that is not readable. We can optionally add these graphs as sets of matrices using the “return.neighbor” parameter:

se <- FindNeighbors(
            se, 
            # We must set return.neighbor = TRUE to be able to access these matrices.
            # I wasn't able to get the shared nearest neighbors graph to save as matrices, but the kNN appears.
            return.neighbor = TRUE,
            k.param = 30
)
# Now we can print which cells are closest to one another as a matrix
se@neighbors$RNA.nn@nn.idx[1:10, 1:10]
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
 [1,]    1 30306 23019  8903 32068 59309 14714 54437 13081 36150
 [2,]    2 44738 38727  6507 14448 32845 48606 43169 30367 31626
 [3,]    3 10378  2009 57301 50773 19587 56172 44510 26456 48092
 [4,]    4 44008 33059  4474 36589 33119  7758 48445 46259 44169
 [5,]    5 32484 38081 19971 47756 27292 43558 16928 26009  8331
 [6,]    6 27166 39915 24243 47154 17092 13643 56797 44224 13881
 [7,]    7 24072 52777 47682 27358 27987 54311 52625 53452 45804
 [8,]    8 45468 55070  8114 34366 12748 33649 12651 45972 10049
 [9,]    9 41110 43383 21549 16508 41384  8152 51537 46590 33528
[10,]   10  5878 42823 46767 10526 53954  8363 32476 30185 45876

The nearest neighbors graphs are stored in the neighbors slot of the seurat object, and are stored as two matrices: - se@neighbors$RNA.nn@nn.dist; the matrix of distances between each cell (rows) and its closest k.param neighbors (columns) - se@neighbors$RNA.nn@nn.idx; the matrix of which cell those closest k.param cells are. But because these are matrices, it also stores a list of the cell names for index <-> cellname translation - se@neighbors$RNA.nn@cell.names

se <- FindNeighbors(
            se, 
            k.param = 30
)

You can imagine that if we try to display 30 lines from each cell connecting to other cells on a dataset of 60,000 cells, we would be trying to view >1,000,000 lines! This not only isn’t computationally feasible for many computers, but it’s also just not readable to the eye.

For educational purposes let’s visualize this graph on a subset of 1000 cells:

seSub <- subset(se, downsample = 1000) %>%
        NormalizeData() %>%
        FindVariableFeatures() %>%
        ScaleData() %>%
        RunPCA(npcs = 30)

seSub <- FindNeighbors(
    seSub,
    k.param = 30,
    return.neighbor = TRUE
              ) 

Here we write a function that takes in a Seurat object with a KNN-graph in se@neighbors$RNA.nn@nn.idx and returns a plot with the edges between cells.

We’ve included a function here to graph the kNN graph

process_and_graph_connectivity(seSub) + 
    labs(title = "K=30 Cell Connectivity in PCA Space", x = "PC_1", y = "PC_2")

We can see here that most cells are connected to those in their immediate vicinity, but cells that are more extreme along the x and y axes are also connected to more distant cells. If we decrease our k.param, this will be less pronounced.

These nearest neighbors are based on the whole PCA latent space - the distances are decided based on the “euclidean distance” between cells in ALL of the dimensions of the PCA.

seSub <- FindNeighbors(
    seSub,
    k.param = 5,
    return.neighbor = TRUE
    ) 

process_and_graph_connectivity(seSub) + 
    labs(title = "K=5 Cell Connectivity in PCA Space", x = "PC_1", y = "PC_2")

Clusters are then calculated based on the shared nearest-neighbor SNN graph. The shared-nearest neighbor graph is a graph where the distances are replaced with the number of shared nearest neighbors between each cell and its nearest neighbors. So the se@neighbors$RNA.snn@nn.idx matrix is the same, showing which cells are the nearest neighbors, but the `se@neighbors$RNA.snn@nn.dist would be an integer number, the number of neighbors that are the same between each pair of cells. The SNN graph is used to make the clustering more robust to outliers.

Let’s see what happens to clusters when we modulate our k.param.

seSub <- FindNeighbors(
    seSub,
    k.param = 20
    ) 

seSub <- FindClusters(seSub)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 1000
Number of edges: 28464

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8402
Number of communities: 10
Elapsed time: 0 seconds
knn20 <- DimPlot(
    seSub,
    group.by = 'seurat_clusters',
    reduction = 'pca') +
    ggtitle('k.param = 20')

seSub <- FindNeighbors(
    seSub,
    k.param = 5
    ) 

seSub <- FindClusters(seSub)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 1000
Number of edges: 7617

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8869
Number of communities: 17
Elapsed time: 0 seconds
knn5 <- DimPlot(
    seSub, 
    group.by = 'seurat_clusters',
    reduction = 'pca') +
    ggtitle('k.param = 5')

(knn5 + knn20)

As we can see, with a higher k.param, there are fewer clusters. This is because the neighborhoods are smaller when we ask for fewer neighbors! However, in random sub-samples of the data, these clusters will be less stable because the neighbors depend on who is kept in. With a smaller k, we might get spurious clusters. Therefore, choosing k is a balance of stability and resolution on rare cell types.

celltype_pca <- DimPlot(
        seSub,
        reduction = "pca",
        group.by = 'Celltype'
        ) 

donor_pca <- DimPlot(
        seSub,
        reduction = "pca",
        group.by = 'donor_id'
        ) + scale_color_manual(values = donor_pal)

(knn5 + knn20) / (celltype_pca + donor_pca)

Even with a smaller k, we are finding multiple clusters within the authors’ overarching “classical monocyte” label. However, we see a reasonable distribution of donors across them.

If we make an Alluvial plot, we can see the differences between our clusters and the author labels

se <- se %>%
    FindNeighbors(k.param = 5) %>%
    FindClusters(resolution = 0.05) %>% # let's use a very small resolution for simplicity's sake
    RunUMAP(
        reduction = 'pca',
        dims = 1:30,
        reduction.name = 'pca_umap')
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 601333

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9831
Number of communities: 8
Elapsed time: 2 seconds
metadata <- se@meta.data
plot_data <- as.data.frame(table(metadata$seurat_clusters, metadata$Celltype))
colnames(plot_data) <- c("Cluster", "Celltype", "Freq")

ggplot(plot_data,
       aes(axis1 = Cluster, axis2 = Celltype, y = Freq)) +
    geom_alluvium(aes(fill = Cluster), width = 0.1) +
    geom_stratum(width = 0.1) +
    geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
    scale_x_discrete(limits = c("Cluster", "Cell Type"), expand = c(0.15, 0.05)) +
    labs(title = "Alluvial Plot from Seurat Clusters to Cell Types",
         x = "Clusters and Cell Types",
         y = "Count") +
    theme_minimal()

Our low resolution clustering has captured the major cell types: 0 appears to be myeloid cells, 1 is T cells, 2 is NK, 3 is B, 4 is platelets, 6 and 7 are RBCs and one of the Uncategorized clusters. But 5 is a secondary myeloid celltype

I like to immediately plot the distribution of samples per celltype to make sure we are getting useful clusters in our data.

se@meta.data %>%
    count(donor_id, seurat_clusters) %>% 
    group_by(seurat_clusters) %>%
    mutate(composition = n/sum(n)) %>%
    ggplot(aes(x = seurat_clusters, y = composition, fill = donor_id)) + 
        geom_bar(position = 'stack', stat = 'identity') + 
        scale_fill_manual(values = donor_pal)

So we have a good distribution of samples across our normal celltypes, and we probably got all our RBCs from a single sick patient’s bloody nose. In cluster 6, we have cells mostly from sick patients - these are probably some kind of activated cells.

We can get the best idea of what’s going on there with a differential expression, but let’s save that for later.

Assessing Harmony integration

Switching gears a little bit before we get ahead of ourselves making clusters, let’s double check we don’t have any batch effects at play that may cause batch-specific clustering.

Usually our batch effects will show up as some vector of variation across a metadata variable. This means they will usually show up in our PCA as a PC, so we can check for them by correlating our PCs with our metadata!

plot_batch_effect_heatmap <- function(
                                     seurat_object, 
                                     latent_space = 'pca', 
                                     num_pcs = 20, 
                                     metadata_columns_to_include = NULL,
                                     plot_title = "Correlation Heatmap of Metadata and PCA Components") {
    metadata <- seurat_object@meta.data
    metadata <- metadata[, metadata_columns_to_include, drop = FALSE]
    pca_data <- Embeddings(seurat_object, latent_space)[, 1:num_pcs] # pull the PCs we want to check

    categorical_cols <- sapply(metadata, function(col) is.factor(col) || is.character(col))
    
    # convert character columns to factors for one-hot encoding
    metadata[categorical_cols] <- lapply(metadata[categorical_cols], as.factor)

    #1hot without removing redundant dummies
    metadata_encoded <- fastDummies::dummy_cols(
        metadata,
        remove_first_dummy = FALSE,
        remove_selected_columns = TRUE)

    combined_data <- cbind(metadata_encoded, pca_data) # combine metadata and PCA data
    
    # calculate the correlation matrix
    correlation_matrix <- cor(combined_data, use = "pairwise.complete.obs")

    # Subset the correlation matrix to include only metadata and PCA components
    metadata_columns <- colnames(metadata_encoded)
    pca_columns <- colnames(pca_data)
    subset_correlation_matrix <- correlation_matrix[metadata_columns, pca_columns, drop = FALSE]
    
    # Check for any NA values in the subset correlation matrix and replace them with 0
    subset_correlation_matrix[is.na(subset_correlation_matrix)] <- 0
    
    # Create the heatmap using corrplot
    corrplot(subset_correlation_matrix, method = "circle", tl.col = "black", tl.srt = 45,
             number.cex = 0.7, tl.cex = 0.7,
             main = plot_title)
}

plot_batch_effect_heatmap(
    se, 
    metadata_columns_to_include = c('Age', 'sex', "donor_id", "Number of UMI",
                                    "disease", "Severity"))

We don’t really have any metadata that are correlating with our PCs. So it seems we’ve found a great dataset without much batch effect! Out of all of the PCs here, PC8 shows strongest correlations with datapoints that we probably don’t want to include in scientific discourse - it’s highly correlated with a single patient, and that likely causes a spurious correlation with the metadata this patient has. So, for educational purposes, let’s see what happens when we integrate over patient_id to try to remove this effect.

Integration with Harmony

se <- se %>% 
    RunHarmony(
        group.by.vars = 'donor_id',
        reduction.use = 'pca',
    ) %>%
    FindNeighbors(reduction = 'harmony', k.param =5) %>%
    FindClusters(resolution = 0.05)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 59572
Number of edges: 622173

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9829
Number of communities: 5
Elapsed time: 2 seconds
se <- RunUMAP(se, reduction = "pca", dims = 1:30, reduction.name = "harmony_umap")

h1_2 <- DimPlot(se,
        reduction = 'harmony',
        group.by = 'donor_id',
        dims = c(1,2)
        ) + 
        scale_color_manual(values = donor_pal) +
        NoLegend()

h1_8 <- DimPlot(se,
        reduction = 'harmony',
        group.by = 'donor_id',
        dims = c(1,8)
        ) + NoLegend() +
        scale_color_manual(values = donor_pal)

pcs1_2 <- DimPlot(se,
        reduction = 'pca',
        group.by = 'donor_id') + 
        scale_color_manual(values = donor_pal) +
        NoLegend()

pcs1_8 <- DimPlot(se,
        reduction = 'pca',
        group.by = 'donor_id',
         dims = c(1,8)) + 
        scale_color_manual(values = donor_pal) 

(pcs1_2 + pcs1_8) / (h1_2 + h1_8)

By comparing the PCs directly, we can already see the Flu 1 effect has disappeared from PC_8. Typically we would next want to check the effects of this change on the correlation structure of our whole dataset. We can do this by plotting again the correlation of metadata covariates

plot_batch_effect_heatmap(
    se, 
    latent_space = 'harmony',
    metadata_columns_to_include = c('Age', 'sex', "donor_id", "Number of UMI",
                                    "disease", "Severity"),
    plot_title = "Correlation Heatmap of Metadata and Harmony latent variables")

I like to immediately plot the distribution of samples per celltype to make sure we are getting useful clusters in our data.

plt1 <- DimPlot(
    se,
    reduction = 'pca_umap',
    group.by = c('donor_id')) + 
    scale_color_manual(values = donor_pal) | DimPlot(
        se,
        reduction = 'pca_umap',
        group.by = c('Celltype'))

plt2 <- DimPlot(
    se,
    reduction = 'harmony_umap',
    group.by = c('donor_id')) + 
    scale_color_manual(values = donor_pal) | DimPlot(
        se,
        reduction = 'harmony_umap',
        group.by = c('Celltype'))

plt1 / plt2

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fastDummies_1.7.3    corrplot_0.92        ggalluvial_0.12.5    ggforce_0.4.1        colorRamps_2.3.4     RColorBrewer_1.1-3   tidygraph_1.3.1      ggraph_2.1.0         scales_1.3.0         DT_0.33              colorBlindness_0.1.9 harmony_0.1.1        Rcpp_1.0.12          devtools_2.4.5       usethis_2.2.2        lubridate_1.9.2      forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2          readr_2.1.4          tidyr_1.3.1          tibble_3.2.1         ggplot2_3.5.0        tidyverse_2.0.0      Seurat_5.1.0         SeuratObject_5.0.2   sp_2.1-3            

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22       splines_4.3.1          later_1.3.2            polyclip_1.10-6        lifecycle_1.0.4        globals_0.16.2         processx_3.8.2         lattice_0.21-8         MASS_7.3-60            crosstalk_1.2.1        magrittr_2.0.3         plotly_4.10.4          sass_0.4.8             rmarkdown_2.26         jquerylib_0.1.4        yaml_2.3.8             remotes_2.5.0          httpuv_1.6.14          sctransform_0.4.1      spam_2.10-0            sessioninfo_1.2.2      pkgbuild_1.4.2         spatstat.sparse_3.0-3  reticulate_1.35.0.9000 cowplot_1.1.3          pbapply_1.7-2          abind_1.4-5            pkgload_1.3.2.1        Rtsne_0.17             tweenr_2.0.2           ggrepel_0.9.5          irlba_2.3.5.1          listenv_0.9.1          spatstat.utils_3.0-4   goftest_1.2-3          RSpectra_0.16-1        spatstat.random_3.2-2  fitdistrplus_1.1-11    parallelly_1.37.0      leiden_0.4.3.1         codetools_0.2-19       tidyselect_1.2.0       farver_2.1.1           viridis_0.6.4          matrixStats_1.2.0      spatstat.explore_3.2-6 jsonlite_1.8.8         ellipsis_0.3.2         progressr_0.14.0       ggridges_0.5.6         survival_3.5-7        
 [52] tools_4.3.1            ica_1.0-3              glue_1.8.0             gridExtra_2.3          xfun_0.42              withr_3.0.0            fastmap_1.1.1          fansi_1.0.6            callr_3.7.3            digest_0.6.34          timechange_0.2.0       R6_2.5.1               mime_0.12              gridGraphics_0.5-1     colorspace_2.1-0       scattermore_1.2        tensor_1.5             spatstat.data_3.0-4    utf8_1.2.4             generics_0.1.3         data.table_1.15.4      prettyunits_1.1.1      graphlayouts_1.0.0     httr_1.4.7             htmlwidgets_1.6.4      uwot_0.1.16            pkgconfig_2.0.3        gtable_0.3.4           lmtest_0.9-40          htmltools_0.5.7        profvis_0.3.8          dotCall64_1.1-1        png_0.1-8              knitr_1.45             rstudioapi_0.16.0      tzdb_0.4.0             reshape2_1.4.4         nlme_3.1-163           cachem_1.0.8           zoo_1.8-12             KernSmooth_2.23-22     parallel_4.3.1         miniUI_0.1.1.1         pillar_1.9.0           grid_4.3.1             vctrs_0.6.5            RANN_2.6.1             urlchecker_1.0.1       promises_1.2.1         xtable_1.8-4           cluster_2.1.4         
[103] evaluate_0.23          cli_3.6.3              compiler_4.3.1         rlang_1.1.3            crayon_1.5.2           future.apply_1.11.1    labeling_0.4.3         ps_1.7.5               plyr_1.8.9             fs_1.6.3               stringi_1.8.3          viridisLite_0.4.2      deldir_2.0-2           munsell_0.5.0          lazyeval_0.2.2         spatstat.geom_3.2-8    Matrix_1.6-5           RcppHNSW_0.6.0         hms_1.1.3              patchwork_1.2.0        future_1.33.1          shiny_1.8.0            ROCR_1.0-11            igraph_2.0.2           memoise_2.0.1          bslib_0.6.1